In this section we’ll continue using the colonrectal cancer dataset. But besides sub group information, we’ll explore other sample information.
CRC <- read.csv("./data/CRC_train.csv")
str(CRC[,73:79])
## 'data.frame': 200 obs. of 7 variables:
## $ Sample : Factor w/ 200 levels "P1A1","P1A10",..: 2 4 5 7 11 12 13 16 18 21 ...
## $ Group : Factor w/ 2 levels "CRC","Healthy": 1 1 1 1 1 1 1 1 1 1 ...
## $ Age : int 60 70 65 65 62 55 61 52 89 81 ...
## $ Gender : Factor w/ 2 levels "female","male": 1 2 2 1 1 2 2 2 1 2 ...
## $ Cancer_stage : int 1 1 1 4 3 2 2 4 4 1 ...
## $ Tumour_location: Factor w/ 2 levels "colon","rectum": 1 2 2 1 1 1 2 2 1 2 ...
## $ Sub_group : Factor w/ 3 levels "Benign","CRC",..: 2 2 2 2 2 2 2 2 2 2 ...
# Inspect the possible values for the annotation columns
unique(CRC[, 'Cancer_stage'])
## [1] 1 4 3 2 NA
unique(CRC[, 'Tumour_location'])
## [1] colon rectum <NA>
## Levels: colon rectum
unique(CRC[, 'Sub_group'])
## [1] CRC Benign Healthy
## Levels: Benign CRC Healthy
geom_bar() makes the height of the bar proportional to the number of cases in each group.
library(ggplot2)
# Investigate the number of each sub_group
g1 <- ggplot(CRC, aes(Sub_group))
g1 + geom_bar()
# horizontal barplot
g1 + geom_bar() +
coord_flip()
We can also calculate the number of patients in each group manually and then make bar plot with stat="identity"
library(dplyr)
# count the number of samples in each sub group
n_subgroup <- CRC %>% group_by(Sub_group) %>% summarise(N = n())
n_subgroup
## # A tibble: 3 × 2
## Sub_group N
## <fctr> <int>
## 1 Benign 34
## 2 CRC 100
## 3 Healthy 66
ggplot(n_subgroup, aes(x=Sub_group, y=N)) +
geom_bar(stat="identity")
We want to know the count of different gender patients in each sub group
g2 <- ggplot(CRC, aes(Sub_group, fill=factor(Gender)))
# Stacked barplot puts overlapping bars on top of each other
g2 + geom_bar(position = "stack")
# place overlapping bars side-by-side
g2 + geom_bar(position = "dodge")
#displays proportions
g2 + geom_bar(position = "fill")
Pie chart is changed from bar chart through coord_polar()
ggplot(data = CRC) +
geom_bar(mapping = aes(x = factor(1), fill = factor(Sub_group))) +
coord_polar(theta = "y")
We want to Investigate the abundance distribution of one protein. geom_histogram() shows the distribution of a single variable
g3 <- ggplot(CRC, aes(SERPINA3))
g3 + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
freqency polygons show the distribution of a single variable with lines
g3 + geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Challenge If you want to know the density distribution of protein AFM among all the observations, what plot will you make?
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\) \(~\)
g <- ggplot(CRC, aes(AFM))
g + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
g + geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We want to investigate the relation between two proteins. geom_point() is used to create scatterplots. The scatterplot is most useful for displaying the relationship between two continuous variables.
g <- ggplot(CRC, aes(x = SERPINA3, y = TIMP1))
g + geom_point()
If you have a scatterplot with a lot of noise, it can be hard to see the dominant pattern. In this case it’s useful to add a smoothed line to the plot with geom_smooth()
g + geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess'
smoothScatter() produces a smoothed color density representation of a scatterplot
smoothScatter(CRC$SERPINA3, CRC$TIMP1)
Challenge If you want to the relationship of abundance between protein AFM and AHSG, what plot will you make?
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\) \(~\)
ggplot(CRC, aes(x = AFM, y = AHSG))+
geom_point()+
geom_smooth()
## `geom_smooth()` using method = 'loess'
geom_text can add text to the plot
g1 <- ggplot(CRC, aes(x = Sample, y = TIMP1))
g1 + geom_text(aes(label = Age))
check_overlap can remove the overlapped texts
g1 + geom_text(aes(label = Age), check_overlap = TRUE)
We can aslo add points to the plot.
g1 + geom_text(aes(label = Age), check_overlap = TRUE) +
geom_point()
nudge_x and nudge_y can offset the text
g1 + geom_text(aes(label = Age), nudge_y = 0.05, check_overlap = TRUE) +
geom_point(shape = 1, aes(color = Sub_group))
ggrepel implements functions to repel overlapping text labels away from each other and away from the data points that they label.
library(ggrepel)
g1 + geom_point(shape = 1, aes(color = Sub_group)) +
geom_text_repel( aes( label = Age), check_overlap = F,
position=position_jitter(width=.1, height=.2) )
## Warning: Ignoring unknown parameters: check_overlap, position
Adding labels to bar plot
# create a ggplot data
ggplot(n_subgroup, aes(x=Sub_group, y=N)) +
# draw the bar plot
geom_bar(stat="identity") +
# create the number text above the bar in white
geom_text(aes(label=N), vjust=1.5, colour="white")
We want to investigate how protein abundance varies across different groups. geom_boxplot() summarise the shape of the distribution with a handful of summary statistics.
b <- ggplot(CRC, aes(Sub_group, SERPINA3))
b + geom_boxplot()
We can also overlay multiple geoms by simply adding them one after the other.
b + geom_boxplot() + geom_jitter()
b + geom_jitter() + geom_boxplot()
b + geom_jitter(alpha = 0.5) + geom_boxplot()
We want to consider another factor gender
b + geom_boxplot(aes(fill= factor(Gender)), position = "dodge")
A violin plot is a mirrored density plot displayed in the same way as a boxplot.
b + geom_violin()
We want to know how the protein abundance changes across different samples.
ggplot(CRC[1:100,],aes(Sample, SERPINA3, group = 1)) +
geom_point(size = 0.5) +
geom_line()
Challenge If you want to compare the abundance distribution of protein AFM among different groups, what plot will make? And you also want to see the difference between male and female in each group, what change will make? How to make the boxplot horizontal direction? Add data points to the boxplot. \(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\) \(~\)
ggplot(CRC,aes(Sub_group, AFM)) +
geom_boxplot()
ggplot(CRC,aes(Sub_group, AFM, fill= factor(Gender))) +
geom_boxplot(position = "dodge")
ggplot(CRC,aes(Sub_group, AFM, fill= factor(Gender))) +
geom_boxplot(position = "dodge") +
coord_flip()
ggplot(CRC,aes(Sub_group, AFM, fill= factor(Gender))) +
geom_boxplot(position = "dodge")+
geom_jitter()